Empirical Risk Minimization of AUROC 


A Preprint 


Victor W. Rielly Department of Mathematics 

Portland State University 
Portland, OR 97213 
victor23@pdx.edu 


February 26, 2020 

Abstract 

For unbalanced binary classification problems, or problems requiring a trade off between true posi¬ 
tive and false positive rates, machine learning models are often evaluated using AUROC (Area under 
the Reciever Operating Characteristic Curve) or just AUC for short. Many papers have been written 
in an attempt to create machine learning models that optimize directly for AUROC. We present a 
review of the current literature on the subject, some new results and interpretations of the task, as 
well as some experimental results. 

.Keywords AUROC • Empirical Risk Minimization ■ Machine Learning • Statistics 


1 Introduction 

Machine Learning models developed for binary classification problems with unbalanced data sets are usually 
evaluated using AUROC, which provides a measure of how well the model ranks the data instances with respect to the 
expected binary labels. In cases involving unbalanced data, accuracy does not provide a stable measure of a machine 
learning model because even trivial models give good accuracy. Lor instance, in a data set with 100 negative instances 
for every positive instance, a trivial classifier, that classifies everything as negative, will be very accurate but may rank 
the data set very poorly. 

AUROC is also often used to evaluate machine learning models in the medical field where a scientist may be more 
interested in how confident a model is of a certain classification, then whether the classification is correct or incorrect. 
Lor this reason, many efforts have been made to develop a framework to train models to optimize AUROC in place of 
its surrogate accuracy. 

We present a review of what is known about the generalization, uniform convergence, and consistency of machine 
learning models for AUROC optimization, as well as some new observations we have made on this topic. 


2 Theory: 

2.1 Generalization Bound: 


Before we set out to develop models to optimize AUROC, we should develop some theoretical framework to justify 
optimizing for AUROC. To this effect, we begin with a generalization bound for AUROC. Let A' (f, T) be the empirical 
AUROC of a function f(x) : S —>■ 1R, on a test set T, where for t £ T, t = (x, y) with x £ S where S is the sample 
space and y £ {—1,1}. We see the AUROC is given by: 


E E : 

{r. V i=+ 1} {j'.yj=- 1} 


-{+ 9 If/(*<)=/(*i)} 


A(/;T) 


nm. 


( 1 ) 




A preprint - February 26, 2020 


Where 


and 


m = 1 

{i-Vi=+ 1} 

n = E 1 

{i--Vi = - 1} 


From the equation for A'(f, T) we see that AUROC of / with respect to T is just the fraction of positive-negative 
pairs in T that are ranked correctly by /, assuming that ties are broken uniformly at random [1],(Cortes and Mohri, 
2004). It is common to also define empirical AUROC by 

= ^ E E I {/(* < )>/(^)} ( 2 ) 

{ iiW=+l }{ j: W =-l} 


in which case ties are considered wins. Also, let A'(f), the expected ranking accuracy of the function f, be defined 
by: 

A\f) = E x~D+,X'~D- |l{/(-Y)>/(A-')} + 2 I {/(^)=/(^'')}} ( 3 ) 

Where X is sampled from CL., the underlying distribution of positive instances of our data, and X' is sampled from 
D _, the underlying distribution of negative instances of our data. We desire a result of the form: 

P{\A\f,T)-A'{f)\>e}<8 

That is to say, the probability that the expected ranking accuracy of a function / differs from the observed empirical 
AUROC on the test set by e is smaller than 5. 

This kind of result helps us in the following hypothetical situation. If we develop a model to identify poisonous 
mushrooms from edible mushrooms, we can test the model on a new test set of 10000 mushrooms. Let’s say we 
observe an empirical AUROC of .9 on this test set, then we can tell our significant other, there is a probability of less 
than S that the model we have created will have an AUROC that is either greater than .9 + e or less than .9 — e. 

[1] Gives us our desired generalization bound. In [1] we see that 


P i \A'(f',T) — A'(f)\ > 

For the one sided bounds we will see 

P {A'(/;r)-A'(/)>l 


Ml) 


2p(T y )(l-p(T y ))N 


Ml) 


2p(Ty)(l ~ p(Ty))N j 


< 6 


< 6 


(4) 


(5) 


and 




Ml) 


2p(T y )(l-p(T y ))N f 


< 6 


(6) 


Where N is the size of the test set, and p{T y ) is the proportion of positive instances on our test set. Refer to 1 for 
graphs of the qualitative behavior of this bound. 

Typically, we wish to have S written as a function of e but the dependence on the observed ratio of positives to negatives 
on the test set precludes us from solving the inequality for d in terms of e. This is not exactly the format of the bounds 
we where looking for but in practice, this works just as well. One may use an apriori estimate of p(T y ) to figure 
out how large of an N is needed to be within any desired e neighborhood of the empirical AUC. The proof of this 
is a relatively direct application of McDiarmid’s inequality which was proven in 1989. McDiarmid’s inequality is as 
follows: 


Theorem 1 Let X\, X 2 , X 3 , ... ,Xpf be independent random variables, with Xk taking values in a set A}-for each k. 
Let (f> : (A_! x ... x An) —> 1R , be such that 

sup \4>{. x ti -,X N ) - f{xi, ...,x k _i 1 x' k ,x k+1 , atjv)| < C fe 

XiGAi,x' k GAk 

Then for any e > 0, 

o 2 / x—> N 2 

P(\<I>(X U ..., X N ) - ..., Xjv)}| > e) < 2e -2<r /£*=i c * 


2 



A preprint - February 26, 2020 


DeltaN.png 


EpsilonN.png 


(a) Probability as a function of N for fixed interval size of (b ) interval size epsilon as a function of N for a fixed prob- 
epsilon. ability delta of .05. 


Figure 1: Generalization Bounds 


In particular we have the following pair of one sided inequalities 

o 2 / v—> N 2 

P(f{x !,..., X N ) - E{<HAG,..., Xjv)} > e) < e _2e /> : 'A c * 

< - c ) <e~ 2e2 /^ c ^ 


Before we proceed with the generalization result, let us build for ourselves a better understanding of McDiarmid’s 
inequality by applying it to a real world example. McDiarmid’s inequality was used to prove the generalization bound 
for accuracy, and we will use the same approach for AUROC. Suppose for some fixed /, 

1 N 


Notice that 


sup \<f>(xi, Xn) - <j){x\, ...,X k -l,x' k ,X k+ i, ...,x N \ < —. 

Xi&Ai,x' k &Ak 


That is, if we change only one of our input elements into </>, our average accuracy changes by at most . Our metric 
< f> is referred to as f stable with /? = . McDiarmid’s 2 sided inequality would give us directly. 


P{\cj>{X u ...,X N ) -E{ci>(X u ...,X N )}\ >e)< 2e- 2e2/E "=i & = 2e~ 2e2N 


McDiarmid’s one sided inequalities would each give us 

P^pft,...,^) ~ V{<f>(X u ...,X N )} >e)< e- 2e2jv 


and 

P(<j>(Xi, ...,X N ) — E{0(A!, ...,Xjv)} < -e) < e- 2e2jv 

This is telling us, that as N gets large, we can expect 4>(xi ,..., Xm) to be approximately E{^(A 1; ..., X/v)}, because f 
is stable in the sense that it does not favor any of its inputs over any other. In our, case we wish to apply McDiarmid’s 
inequality to (f>(f 1 T) = A(f,T), so we will need to have some added subtlety. 


Theorem 2 Let f : S — t R be a fixed ranking function on S and let p(T y ) be the proportion of positive instances on 
the test set then for any 5 > 0 


P 




I ln(j) 

2p(Ty)(l-p(Ty))N 


< 6 


The proof of this may be found in [1 ] 


(7) 


3 



A preprint - February 26, 2020 


Proof: Given the sequence of labels, y' = (y\ , i)m) £ Y N , the random variables X\, ... , A y are independent, 
with each A k taking values in S. Lets define a function <b : S N —> R as follows: 

</>(xi, x N ) = A'(f ; ((x!, t/i),(x N , y N ))). 


Then for each k such that y k = 1, and for to, n the number of positive and negative instances in y' respectively we 
have for all x*,, xj, 6 S: 

|(/>(xi,..., Xjy) i, , x^_(_x, • • •, )| 


1 

mn 





2^{/( x fc)=/( x 3')} 




n 1 

tou to 


Similarly, for each fc such that = —1 we have for all x*,, x' k £ S, 


|^>(xi, ...,xjv) - <K x i, «-,x fc _ 1 .,x fc ,x fc+1 , ...,Xjv)| < - 

n 


Therefore, the 2-sided McDiarmid’s inequality tells us. 


P Tx \ v ^{\A'(f;T) - A'(f) | > e} < 2 e" 2 ^ E" AO 2 


Where = — if = 1 and = - if = — 1 Now, 


AT 


E ( c *) 2 = 


m n 


1 1 m + n 


i— 1 


p Tx| 3 /=WI^'(/; T ) - ^'(/)l > 4 < 2e _2<r m " /(m+ ") = 2e _2£ p(1 - p)JV (8) 

Similarily, the one sided McDiarmid’s inequalities each give us 

P Tx , y=y ,{A’{f-T) - A'(f) >e}< e ~^ 2p ^ N (9) 

and 

P Tx , y=yl {A\f;T) - A'{f) < -e} < e" 2 * 2 * 1 "^ (10) 

We are almost home. The inequalities presented above are conditioned on T y=y '. This means they only hold after we 
fix y = y' and sample from T under the assumption that y = y'. We need to obtain inequalities after we relax that 
restriction. Notice that, if we let < 5 = 2e~ 2e mn /l m + n ) and solve for e in 8, we get the statement, equivalent to 8 that 


^Txly—y' {\A'(f;T)-A'(f)\> 


MS) 


< s 


2p(l — p)N J 

To remove the conditional dependence of equation 11 on y', observe the following argument. We take 


( 11 ) 


*(T,6)= { \A'(f, T) — A'(f)\ > 


I ln(|) 

2p(T y )(l-p(T y ))N j 


to be a statement that can either be true or false, and we seek to try to get a bound on the probability that this statement 
is true. For any 0 < 5 < 1, we have 


Thus, we see that 


P{$(T,5)} = Et{I W) } 

= E Tj) {E Tx |j, =1( ,{I$( T|( 5)} 

= E Ty {PT x |^{$(T,(5)}} 

< Et v {<5} (by our conditional result) 
= 6 


P{\A'(f-T)-A'(f)\> 


Ml) 


2p(Ty)(l-p(Ty))N 


< s 


( 12 ) 


4 



A preprint - February 26, 2020 


Similar arguments provide the following one sided bounds 

p |a'(/;T)-a'(/)> 

and 

P {A'(f-T)-A'(f)<- ] 


Mi) 


2p(T y )(l-p(T y ))Nj 

I Mi) s 

2p(Ty)(l-p(Ty))N j 


< 6 


< s 


(13) 


(14) 


Ml) 


If we want an estimate of e, we may use an apriori estimate of p(T y ) = p and take e to be y 2 p(i-p)iv anc ^ s °l ve f° r ^ 
in 12 we get 

P{|A'(/; T) - A'{f) | > e} < 2e~ 2e2p{1 ~ p)N (15) 

with analogous results for the one sided bounds. We apply the same analysis in appendix B, to explicitly derive 
generalization bounds for AUROC defined by A(f, T) and A(f) in place of A'(f, T) and A'(f) and obtain the exact 
same bounds 


P^|A(/;T)-A(/)|> 


MS) 


2p(Ty)(l-p(Ty))N 


< 6 


To complete our example, suppose we construct a mushroom identification model for identifying edible mushrooms 
that has an observed ranking accuracy of .9 on a 10000 point test set in which 30 percent of the mushrooms in our test 
are edible, and suppose our significant other asked us the question, what’s the probability that the actual AUROC of 
your model is outside the .85 — .95 window? We would be able to answer, why there is a less than a 

2e -2(.3)(.7)10000(.05) 2 = QQQg 


chance (.06 percent). 

What are the takeaways? As e decreases, our uncertainty increases, as N increases, our certainty increases, as p goes 
to 0 our uncertainty increases to complete uncertainty, and as p goes to 1, our uncertainty increases. Our uncertainty is 
minimized as a function of p when p is .5. This inequality is a statement about the ranking accuracy in general and is 
completely independent of the model we use, or the function f we arrive at. Thus, if by some means, we find a model 
that maximizes empirical ranking accuracy, and we test it on an i.i.d. (independent and identically distributed) test set, 
we can provide these guarantees without any further information about the underlying distributions of the problem, or 
any assumptions of the nature of our solution. 

[1] makes their generalization bounds assuming A'(f, t) is described by equation 1, and A'(f) is described by equation 
3. We will instead use equation 2 for our empirical AUROC, and the corresponding expected ranking accuracy. 
However, in practice. The generalization bound described by [1] is directly applicable to our simplified definition of 
AUROC as it is very unlikely for our classifiers to ever evaluate f(xi) = f(xj), so long as Xi ^ x :i for all i,j (So 
long as there are never two identical inputs that are labeled with separate outputs). A more rigorous consideration of 
this subtle difference in definitions may be found in appendix A. 

2.2 Uniform Convergence Results: 

In the previous section we showed that given a function f, a sample space S, two distributions I) + and D_ of positive 
and negative instances respectively over S, and a test set T of samples from D + and D_, we may evaluate the AUROC 
of the function / and expect it to be asymptotically equal to the AUROC of the function / evaluated on a test set as 
the number of samples from D + and D go to infinity. The results we described are independent of the nature of /, 
D + , and D_ and are independent of the procedure we used to generate / so long as we did not use T in creating /. 

For practitioners of machine learning this is enough to make a lot of progress. So long as practitioners are careful 
about setting aside test sets, they can reasonably predict the expected performance of their classifiers by using the 
performance their classifier provides on their test set. / may be generated with support vector machines, linear models, 
nonlinear models, neural networks, etc. and the same guarantees apply. A practitioner may test many different 
approaches to generate / without any concern for how mathematically reasonable their approach is, and they may still 
provide reasonable assurances of the performance of their models. 

But this does not give us any idea of how to best generate our function /. Perhaps the only reasonable approach for 
generating a function / with a desired AUROC or accuracy on an unknown distribution is to find a function / that has 


5 



A preprint - February 26, 2020 


the desired accuracy or AUROC on a training set V that is a reasonable representative of the unknown distribution. 
But how do we know this function will have a similar accuracy or AUROC on the unknown distribution as it has on 
the representative set? 

Now we will see that if we assume some properties of /, we can provide an estimate of the AUROC of / on D + and 
D_, together with a confidence interval by using only the AUROC of / on the training set. 

Suppose we are given a relatively constrained family of functions F, and a training set V of pairs (x,. y,) where 
iji £ {—1,1 }, Xi £ S and V is sampled from two distributions D + and D_ with (a;*, —1) sampled from D_ and 
(xi, 1) sampled from D . Then the results of this section may be summarized by the claim: Any function f £ F 
will have an AUROC on V that is within a small neighborhood of the AUROC of the function / on the underlying 
distributions of S with high probability. That is to say, the AUROC of any f on V will be similar to the true AUROC 
with high probability. Therefore, we can use our training set V to find a function f £ F to get in a desired AUROC 
neighborhood with high probability. This is what is known as a uniform convergence bound and requires we formally 
define the regularity of a collection of functions. 

Before we dive into regularity of families of functions let’s use the results from the previous section to prove the 
following theorem. This was done in [1], 


Theorem 3 Let F be a finite class of real-valued functions on S and let fx> £ F denote the ranking function chosen 
by a learning algorithm based on a training sequence D. Let p(T> y ) be the proportion of the positive points in our 
training set, then for any S > 0, 


Vs 


< \A\fv,V)-A'(f v )\> 


I ln\F\ + In (|) 
2(p{Vy)(l ~ p(Vy))M l 


< S 


Proof: First, we prove 

Pv s \-D y=y ,{\A'(fv;T>) - A(f v )\ >e}< 2\F\e~ 2p ^ l - p ^ M ^ 
Where if = (y\, € Y M and p(y') is the proportion of yfs that are positive. 


Pv s \v y=y , II A\f-V) - A'(f) I > e} < Pv s \v y=yl jmax| A \f-V) - A'(f)\ > ej 

< Pj) s \xi y _ y ,{\A l (f;V) — A'(f)\ > e} (by the union bound) 

feF 

< 2\F\e~ 2pijl ))Me ^ ff ieorern 2). 


To finish the proof, we can set S equal to 2|F|e 2p ^ y p ( y and solve for e to get the following equivalent 
statement. 




< 6 


and we play the same game as we did in theorem 2 to remove the dependence on 1 /, to get 

Pd s ~d" l\A'(fv,V)-A'(fv)\ > 


1 ln\F\+ln(l ) 

2(p(V y )(l-p(V y ))M j 


< 5 


(16) 


Notice, the above results are equally valid for A(fx>\T>) and A(fx>) so long as theorem 2 holds for A(f ; T>) and A(f) 
which was proven in appendix B. Also, the one sided bounds are proven analogously. 


We present the result above because, as we will see after we spend a long time defining regularity of sets of functions, 
our final result for infinite dimensional function sets will be exactly the same as the case for finite dimensional function 
sets. In fact, for a quantity called rank shattering dimension given by r(F, 2m, 2n) we will see. 


P { \A'(fv;V)-A'(f v )\ > 


Iln\r(F , 2m, 2n)\ + In (|) 
2(p(V y )(l-p(V y ))M 


< 5 


(17) 


6 



A preprint - February 26, 2020 


We can get at the regularity of a model (which is often parameterized by a set of functions) by using something called 
the shattering dimension of the model. Shattering dimension was first studied by Vapnik and its description may be 
found in [ 2 ] 

Definition: A model parameterized by 9 is said to be able to shatter a set of points, if for any assignment of y t £ 
{—1,1} to Xi, there is some value of 9 for which the model realizes the given assignment. 

Definition: Given a model parameterized by 9, the shattering dimension of the model is defined as the cardinality of 
the largest set of points the model can shatter. 

This is a very counter intuitive measure of complexity, so we will give some examples to try to get more comfortable 
with it. 

Example 1: A constant classifier has shattering dimension 0. This is because for any point, there is no way that a 
constant classifier can classify the point as both classes, so a constant classifier cannot shatter any nonempty set of 
points. 

Example 2: Suppose 5cB. and your classifier is defined by 

— 1 if x <9 1 

1 otherwise J 

The shattering dimension of this classifier is 1 because there is a set of points of order 1 that this classifier can shatter, 
but no set of 2 points. For instance, take the set {x\ = 0}. The possible labelings of this set are: y i = 1 or y-\ = —1, 
and both of these are attainable by for instance, setting 9 = — 1 and 9 = 1, However, no set of 2 or more points can 
be shattered by this classifier. Suppose a set has the points X\ and x- 2 - If x\ = X 2 , no value of 9 will give us the labels 
c(x i) = 1 and c(a; 2 ) = —1. If Xi 7 ^ X 2 , then without loss of generality assume x\ < X 2 ■ No value of 9 will provide 
us a labeling where c(x 1 ) = 1 and c(x 2 ) = —1. Therefor, if a set has two or more elements, it cannot be shattered by 
this classifier. Thus, the shattering dimension of this classifier is 1. 

Example 3: Suppose S C 1R and your classifier is defined by 

c(x ) = / 1 if 9-l<x<9 + l 
^ | —1 otherwise 

This classifier has dimension 2. This is because there is a set of two points for instance {oq = 0 ,X 2 = 1} that 
can be shattered by this classifier, and no set of 3 or more points. The possible labelings of a set of 2 points are: 
{c(a:i) = l,c0 2 ) = l},{c(a:i) = 1, 0 ( 0 : 2 ) = -l},{c(a;i) = -1, 0 ( 0 : 2 ) = 1}, and {c(oq) = -1, c(x 2 ) = -1}- By 
picking 9 as follows, each of these labelings may be obtained for {x\ = 0, X 2 = 1}. 

9 = .5 => {c(a;i) = 1, c(x 2 ) = 1} 

9 = —.5 => {c(xi) = 1, < 3 ( 0 : 2 ) = —1} 

9 = 1.5 => {c(a;i) = —1, < 3 ( 0 : 2 ) = 1} 

0 — 3 => {< 3 ( 0 : 1 ) = -1, < 3 ( 0 : 2 ) = —1} 

Therefor, this classifier has shattering dimension equal to or greater than 2, to show that it has shattering dimension 
less than 3, we need to prove no 3 points can be shattered by this classifier. Let Xi < X 2 < X 3 be 3 arbitrary points 
in 1R ordered without loss of generality for our convenience. If for some i 7 ^ j, 0 y = Xj then c(xi) = c( Xj), and 
any labeling with < 3 ( 0 : 4 ) 7 ^ c(xj) is not attainable, so we may further assume x\ < X 2 < X 3 . But in this instance, the 
labeling c(o;i) = 1, < 3 ( 0 : 2 ) = —1, and < 3 ( 0 : 3 ) = 1 is clearly not attainable. Thus, there is no way to shatter 3 points with 
this classifier. 

Example 4: Let ScR 2 and let 

1 if sign(9 t x ) > 0 1 
— 1 if sign(9 t x) < 0 J 

where 9 £ 1R 2 . This classifier has shattering dimension of 2. Let oq = (0,1), and X 2 = (1,0). The possible labelings 
for these points are: 

{(-1,-1), (-1,1), (1,-1), (1,1)} 

These labelings may be attained, for this set of 2 points with the following vectors for 9 

{[-1,-1], [2,-1], [-1,2], [1,1]}. 




7 



A preprint - February 26, 2020 


Thus, this model can shatter a set of 2 points. On the other hand, if we let xi, x 2 , and X 3 be any three arbitrary points. 
Suppose there where 8 vectors 9i, 9 2 , 0 3l # 4 ,..., 0 3 that could be used to obtain all possible ( 8 ) labelings of x 3 , x 2 , and 
x 3 . Then, 


r o\ 1 



hi,2 

h\, 3 

01 


h-2,1 

h'2,2 

h 2,3 


■ [xi,x 2 ,x 3 ] = 

h 3 p 

h’3,2 

h 3 ,3 

o\ 

^4,1 

hi,2 

hi,3 

. k. 


^■8,1 

h’3,2 

hs,3 


Where, if we define sign(A) to be the matrix with the components 


(sign(A)) itj 


1 if (sign(A)) itj >0 1 
-1 if ( sign(A))ij < 0 J 


sign(H) 


-1 

T -1 

-1 

-1 

-1 

1 

-1 

1 

-1 

-1 

1 

1 

1 

-1 

-1 

1 

-1 

1 

1 

1 

-1 

1 

1 

1 


Here the row numbers of sign(H) are presented for clarity. We may use row reduction and the definition of sign(H) 
to prove H is full rank by proving we can always row reduce H so that the last 3 rows of H are upper triangular with 
positive diagonals. For all H, we may subtract a suitably scaled copy of row 6 from row 7 to make row 7 have 0 as 
its first entry. Notice, in such a case, row seven would have a strictly positive second entry and a strictly negative third 
entry. This is because a strictly positive number minus a non-positive number is strictly positive, and a non-positive 
number minus a strictly positive number is strictly negative. We may similarly subtract a suitably scaled copy of the 
5th row from the 8 th row to give the 8 th row a 0 first entry, strictly positive second entry and a strictly positive third 
entry. Finally, subtracting a suitably scaled version of the modified 7th row (which has a 0 first entry, strictly positive 
second entry and strictly negative third entry) from the 8 th row will make the 8 th row have zero first and second entry 
and strictly positive third entry. If we recall the equation 18, we see that it is impossible as the right hand side has rank 
3 while the left hand side has rank 2. 

Example 5: The linear classifiers for R J defined analogously to that in example 4 have shattering dimension d. 
Example 6: Let ScE, and let c £ IR the model 

c(x) = sign(sin(cx )) 

has infinite shattering dimension. In particular, it can shatter any set of points of the form {2 m } where m £ IN. 

I provide 1 example to highlight each of the properties I find important about shattering dimension. First, nonlinear 
classifiers often have higher shattering dimensions than linear classifiers, as can be seen with examples 2 and 3. 
Second, classifiers with higher dimensional parameter spaces often have higher shattering dimension, and lastly, 
sometimes shattering dimension is very hard to predict. It may sometimes break all the intuitive rules, as can be seen 
in example 6 . 


Vladimir N. Vapnik, and Alexey Chervonenkis pioneered an approach to proving uniform convergence of accuracy for 
an infinite collection of functions using the shattering dimension of the class of functions (also know as the Vapnik- 
Chervonenkis dimension (VC dimension) [2]). This paper parallels this approach but with a slightly different measure 
of a class of functions complexity. 

Definition: (Bipartite rank matrix [1]) Let f : S —> E be a ranking function on S, let m, n £ IN, and let X = 
(xi, ...,x m ) £ S m and X' = {x \, ■■■,x' n ) £ S n . Define the bipartite rank matrix of / with respect to X , and X' 

denoted by Bf(X, X') to be the matrix in {0, i,l} mxn whose (z, j)—th element is given by 

[Bf(X,X )]ij = I ■f(x i )>f(x' i ) + 

8 



A preprint - February 26, 2020 


for all i G {1,to}, and j G {1, n}. 

Example: For the set of points {x\, x 2 , x 3 , X 4 , £5} = {(1,0), (0,1), (—1, 0), (0, —1), (1,1)} and function 


f(x) = x* 


If X m = (xi, X 2 , *5} and X n = {x 3 , £4}, then 


Bf(X,X') 


1 1 
1 1 


If X m = {x\, x 3 , £5} and X n = {x 2 , £4} then. 


Bf(X,X') 


- 0 1 ' 
0 1 
1/2 1 


Notice, here we would need to diverge from the [1] to consider the problem for AUC defined by A(f ). But the 
difference is very minor, all values of } would be replaced by 1 and the analysis will actually be simplified, and the 
bounds ever so slightly tighter. 

Definition: (Bipartite rank-shatter coefficient) Let F be a class of real-valued functions on S, and let to, n G IN. 
Define the (to, n )—th bipartite rank-shatter coefficient of F, denoted by r(F, to, n), as follows: 

r(F,m,n ) = max \{B f (x,x')\f G F}| 

o ,x G o 


Put in words, the Bipartite rank-shatter coefficient of a class of real-valued functions on S is the maximum number of 
possible obtainable bipartite rank matrices by some collection of points in the positive and negative sets. 

We can make the following direct observations. For a finite collection of functions, 

r(F,m,ri) < |F|. 

This is because each function provides us with at most 1 Bipartite rank matrix for any two sets of points. For any set 
of functions F, 

r(F,m,n ) < 3 mn . 

This is because 3 mn is the maximum total number of to x n matrices with entries of 0, and 1. If we call ties wins 
in our AUC estimate, this inequality becomes: 

r(F, to, n) < 2 mn 

In fact, not all to x n matrices with 0, / and 1 are attainable by any ranking function. For example, for a problem with 

{xi, X 2 } = S 2 and {£ 3 , 24 } = S 2 

' 10 ' 

0 1 

is not attainable as it implies f(x 1 ) > /(£ 3 ), f(x 1 ) < f(x 4), f(x 2) < /(x 3 ) and /(x 2) > /(X4). But this means 

/(x 4 ) < f(x 2 ) < f(x 3 ) 


and 

f(x 3 ) < f{x 1 ) < /(x 4 ) 


so 

f(x 4 ) < f(x 3 ) 

and 

f(x 3 ) > f(X4) 

which is a contradiction. [1] goes into complete detail about which {0, }, 1} matrices are attainable by some ranking 
function. We may follow their example and say 


r(F, in, n) < n) < 3 mn 


Where n) is the total number of attainable matrices. 

[1] proves that a Bipartite rank matrix is attainable if and only if it does not contain one of the following sub matrices. 


'10' 


r k ° 


[ 1 \ 


' 1 o' 


1 o' 


1 1 ' 
2 2 


0 1 


0 1 


0 1 


. 0 \ . 


A 1 . 


0 1 





1 

1 

2 


1 ' 

2 
1 


9 



A preprint - February 26, 2020 


1 

o ‘ 

■ i y 

' 1 

o 

' 1 1 " 

1 

. 2 

1 

2 . 

i ? 

.2 2 . 

? 

. 2 

1 

2 . 

1 2 

2 J 


' 1 

1 ‘ 


r i i 1 

r i 1 1 

To b 1 

2 

1 

2 

0 


[i iJ 

Jo. 

1 l 

L 2 J 



' 0 

i 


1 

2 

l 


' 0 

1 

2 


' 0 

l ' 


0 

l ‘ 


1 

0 


l 

0 


i 

0 


i 

1 

2 . 


1 

2 

0 

r 

0 

l 1 


o 

1 " 


1 

l 


' 1 

1 


' 1 

1 ' 


1 

2 

1 

2 . 


1 

2 

1 

2 . 


? 

2 

1 

2 . 


! 

. 2 

2 

0 


2 

1 

? 

2 . 


Proof: Of one direction We provide a simple proof of one direction of the equivalence. This direction gives us an 
upper bound on (f>(m,n). Any attainable Bipartite rank matrix must not provide any contradictions of the following 
forms: 


f{Xi) < f(Xj),f{Xj) > f(Xi) 


f(Xi ) = f{Xj),f(Xi) < f(Xj ) 


f{Xi) = f(Xj),f(Xi) > f{Xj) 


f{Xi) < f{Xi) 

Notice, an i, j entry of 0 in the Bipartite rank matrix implies /(xj) < /(&'•), and an i. j entry of \ implies f[xj) = 
/(xj) and an i,j entry of 1 implies /(xj) > /(xj) so we do not have to consider contradictions including < or >. We 
may look at each sub matrix as a restriction of / to two subsets each of two points each. For example, if our two sets 
are {xi, X 2 , x m j = S m and {x^, x' 2 ,x' n } = S n , and our Bipartite Rank matrix for a function f has 


as a sub matrix. Suppose Q± t i corresponds to the ith row and jth column of the original matrix, and Q\ : > corresponds 
to the ith row and kth column of the original matrix, and Q 2 ,i corresponds to the Ith row and jth column of the 
original matrix, and Q-i:i then corresponds to the Ith row and kth column of the original matrix. Then the Bipartite 
rank matrix of f on the sets {x,, x;} = S' 2 , |x'-, x' k } = S 2 is precisely 


That is /(xj) > /(x'), /(xj) < f(x[), f(xk) < /(x'-), and f(xk) > /(xj) which is impossible as it leads to the first 
type of contradiction. Thus, for one direction, it suffices to show that each of the sub matrices provided result in one 
of the 4 contradictions listed. One simple observation to make is if a 2 x 2 matrix 

A 


provides a contradiction, then the 2x2 matrix 


B = 


1 1 
1 1 


-A 


will also provide a contradiction as B contains the same equality’s as A, but the opposite direction of inequalities. 
This means, in our proof, we only need to find the contradictions for the first 15 sub matrices. For {xj, Xk) = S 2 and 


10 



A preprint - February 26, 2020 


{<A} = s 2 


1 
0 

1 

2 

0 

1 

2 

0 

1 

0 

1 

1 

2 

1 

2 

0 

1 

2 

0 


f(xi) < f(xk), and f(x k ) < fixd 
f{Xk ) < f{Xk) 

fix'd < fix'j), and > fix'd 
f(xd < fixd 
f(Xk ) < f(Xk) 

fix'd = fix'd, fix'd < fix'd 
fix'd < fix'd, fix'd < fix'd 
f(xk ) = f(xd,fixd < fix k ) 
f{xk ) = f(xd,fixk) < fixd 
fixk ) < fiXk) 

fix'd < fix'd, fix'd < fix'd 
fixd = fix'd, fix'd < fi x d 
fixd = fix'd, fixd < fixd 
fixk ) = fix'd, fix'd < fixk) 
fixk) = fix'd, fix k ) < fix'd 


To prove the other direction, we would need to show, any Bipartite rank matrix that contains one of the 4 contradictions, 
must have one of the 30 above sub matrices. One way to do this is to demonstrate that the remaining 512x2 matrices 
of {0, 2 ,1} are all self consistent, and to show that if a to x n matrix of {0, 1} is inconsistent, then it must have 

an inconsistent 2x2 sub matrix. The other direction is not nearly as interesting as it just says we are dealing with a 
sharp bound, but after all is said and done we will see our bound is very loose for other reasons anyway, so there is 
little point in proving this part of the bound is sharp. 

Theorem 4 Uniform Convergence Bound Let F be a class of real-valued functions on S, and let y 1 = (t/i,..., y m) G 
Y m be any label sequence of length M € IN. Let m be the number of positive labels in y', and n = M - m the number 
of negative labels in y'. Then for any e > 0 

Pv s \v, i sup I A'if-V) - A'if) I > 4 < 4r(F,2m,2n)e- m - 2 /( 8 ( m+ ")) 

[feF J 

Moreover, we may remove the dependence on y' as we did in previous theorems to get the bound 

P^-dm Lup feF \A'if-,V) - A'if)\ > ■ -v*/ > < ^ 


p(V v )il-p(V v ))M 


(19) 


[ 1 ] 


11 



A preprint - February 26, 2020 


To extend these results to our modified definition of AUC, we would need to define Bipartite Rank Shattering dimen¬ 
sion for our simplified loss function. Our Bipartite rank matrices would bemxn matrices with {0,1} entries, and we 
may easily show that no Bipartite rank matrix with the following sub matrices are attainable: 


' 1 0 ' 


' 0 1 ' 

0 1 


1 0 


All other 2x2 sub matrices provide attainable Bipartite Rank Matrices. This in and of itself, gives us an upper bound 
on the rank shattering dimension of our new problem. Next, we could modify theorem 4 to suit our definition of AUC 
using our new definition of bipartite rank shattering dimension. 


[1] provides a way to relate bipartite rank shattering dimension to something like VC dimension but for a 3 class 
problem. In our definition of AUC, our bipartite rank matrices only have 2 choices for each entry, so we would actually 
be able to relate our bipartite rank shattering dimension directly to VC dimension of a 2 class problem. Therefore, we 
could get upper estimates of the bipartite rank shattering dimension of our classes of functions using standard results 
of the VC dimension of our classes of functions. 


3 Consistency: 


The generalization result tells us if we test an AUC classifier on an independent test set and obtain an AUC, we may 
provide a conservative confidence interval for the true AUC on the underlying distribution. The uniform convergence 
results tell us, if we choose a class of functions whose bipartite rank shatter coefficient does not increase too quickly 
with the number of positive and negative training points, we may provide a conservative confidence interval on the 
true AUC based only on the AUC obtained on the training set. In practice, the bipartite rank shattering coefficient 
may be very difficult to bound, or the bound may not be very tight. It is often best to use the test set to provide a 
confidence interval on the performance of the classifier on an unknown set. However, the uniform convergence result 
is still important as it provides insight on how to regularize our families of functions. To understand the problem of 
over-fitting, and under-fitting, a machine learning practitioner should have a tool belt of uniform convergence results. 

The last part of the puzzle is the numerical problem of finding a function, in a class of functions, that minimizes AUC 
on the training set. That is, we need to be able to solve the following problem: 


arg mm ■ 
feF 


= arg max 

feF 


h E E 1 


mn 


{i-Vi=~ 1} {j-Vj-+ 1} 




| mn + 2 ^f( x *)=f( x A f 


Or, as in our case, we seek a solution of the form 


arg mm • 
feF 


1 

mn 


J2 J2 1 f(xi)>f(*A l = arg max- 


1 

mn 


{i-Vi=+ 1} {j-.yj=-l} 


( 20 ) 


In general, both of these problems are non convex np-hard optimization problems, because the indicator function is 
non convex. The indicator function is also not differentiable, and definitely not suitable for gradient decent methods as 
all of the derivatives of the indicator functions are 0. The standard approach to solve this problem is to use a convex, 
often smooth surrogate function in place of the indicator function. A surrogate function is said to be consistent if in 
the limit as the training set gets large, the minimizer of the surrogate loss function approaches the minimizer of the 
original loss function. 


There has been a lot of work done on finding consistent surrogate loss functions for the problem of ranking. Results 
on this can be found in [3],[4] 


4 Surrogate Loss Functions: 

4.1 Method Of Surrogate Loss Functions: 

Many efficient methods for optimizing for auroc have been proposed. For practical purposes, these methods are 
suggested and may be found in various papers such as [5],[6],[7],[8],[9]. We are instead interested in comparing 


12 



A preprint - February 26, 2020 


the performance of a model trained to optimize for AUC with the performance of a model trained to optimize for 
accuracy. Does optimizing for AUC in fact improve the AUC performance of a classifier? Surprisingly, this seems 
to be a controversial topic. A review of some of the controversy may be found in [10] where it is proven that many 
standard surrogate loss functions designed for accuracy also minimize the auroc regret in the limit as the number of 
datapoints go to infinity. These papers seem to suggest there isn’t a great deal to be gained by optimizing for AUC. 
Many standard accuracy optimization procedures may be just as effective at optimizing for auc. 

Our goal was to develop a fair test of the effectiveness of optimizing AUC. To do this we developed a transformation of 
variables that could be performed on the input training set that would allow a standard off the shelf linear solver trained 
for accuracy to be used to find an optimal AUC solution. In this way we could apply the same solver to optimize for 
both quantities and reduce the possible external sources of error and inconsistency in comparing the auc scores of both 
approaches. We wanted to compare apples to apples, instead of apples to oranges. 

Suppose F as in equation 20 is the set of all linear functions and S C R' / . Then 20 becomes 


arg min 

wElt 1 


— E 

mn L ' 




-I W t Xi>W t X-i 




( 21 ) 


The standard approach for this type of problem is to replace the non convex indicator function with a convex surrogate 
loss function and restrict the magnitude of w to be within a certain radius. This can be done by solving the surrogate 
problem. 


arg min 


— E E 

mn z —' z —' 


Xj )) + A 2 w t w 


by comparison, the classical accuracy surrogate problem is 


( 22 ) 


arg min 



w t Xi )) + A 2 w t w 


It can be shown that solving the problem 23 is equivalent to solving the problem 


(23) 


arg min 



s.t. w t w < C 


(24) 


for the appropriate choice of 0 < C. This is difficult to show in general but accessible for the case where <j> is the square 
loss function. The proof of this statement requires the method of Lagrange Multipliers and the Kamsh-Kuhn-Tucker 
conditions also known as the KKT-conditions. 


Traditionally <j> , the surrogate loss function, has been chosen to be a convex function so that equation 23 is a convex 
optimization problem. With an appropriate choice of (j) one may bound the risk of the computed minimizer of 23 by 
the <f> risk of the computed minimizer of 23. The full details may be found in [11], 

The main points of the analysis provided in [11] are, we must choose <f> to be differentiable at 0 and < 0. A 

convex <fi is usually chosen so the resulting problem is a convex optimization problem. A function <f> that satisfies these 
relatively loose conditions is called Accuracy Calibrated and the risk of the minimizer of 23 satisfies the following 
relationship 


VW) - R?) < R^f) - r; 

where R{f) is the risk of a function / which is given as 

E{Iyif(xi)> o} 

(for accuracy), R* is the Baysian risk, R^(f) is the risk of / with respect to the 4> loss function which is given by 


13 



A preprint - February 26, 2020 


and R*^ is the Baysian risk with respect to the <j> loss function, and if is a non-decreasing function if : [0,1] —> [0, oo) 
that is given by 

if = 0(0) - H (1 + 6 
and 

H (v) = inf {v<P( a ) + (1 - 

aGl 


Example: The square loss function is given by 


<t>(x) = (1 - x) 2 


This function is differentiable at x = 0 and the derivative is 2 at x = 0. Also, this function is convex since 
<fi" = 2 > 0. Thus this function is classification calibrated. [11] tells us 


Where 


inf 

qGR 


m = m - h 

i 


v>(i?(/) - r*) < R*{f) - r; 

i + 


= 1 - inf 
ae R 


@ i/ \ 1 ® ±( \ l -r 

— 0(a) + —0(-a) } = inf 


aGR 


1 6 , X 1 _ 0 , r X 

—0(«) + 0(-a) 


1 H - $ xo 1 @ \2 

— 2 — (1 - a )“ ^ 2—(- 1 + a )” 


Now the infimum may be calculated by setting the derivative with respect to a to zero and solving for alpha. The 
infimum is reached when a = 9. Thus, for the square loss function, 

ip{e) = e' 2 


Therefore, for the square loss function. 


(R(f) - r *) 2 < R+if) - r; 

Example: The hinge loss function is given by 

<j>(x) = max{l — x, 0} 

This function is also convex, differentiable at 0 and = —1 < 0. Therefore, this function is accuracy calibrated. 

_|_ Q 1 _ Q 

—-— max{l — a, 0} 3--— max{l + a, 0} 

2 2 

so 


if (9) = 1 — inf 
aSR 


2 _|_ Q 2 _ Q 

max{l — a, 0} H--— max{l + a, 0} 


2 L J 2 
To evaluate the infimum, we need to break the problem up into 3 cases: 


W-«) 


if (9) = 1 - inf ^ - a) + ^-^(1 + a) for -1 < a < 1 


for 


a < — 1 


ael 


H2^(l + a ) 


for 


1 < a 


= 1 — min{l + 9 ,1 — 9} 

= 1 — 1 — min{0, —9} 

= — min{(9, — 9 } 

= l*| 

Thus, for the hinge loss, we get the following guarantee, 

\R(f) - R*\ < R<fi(f) - r; 


(25) 


14 



A preprint - February 26, 2020 


What does 


\R(f)-R*\ <R4,(f)-R; 

tell us? Observe that R* = inf f R(f) and R £ = inf f R^f). The bound 25 tells us if we are given a function f whose 
hinge loss risk is within e of the hinge loss risk of the best function, the risk of that function will be within e of the 
Bayes risk. Thus, minimizing the hinge loss risk of the model will also minimize the risk of the model. 

There are a couple of observations to be made. The first of which is while 1 > R{f)~R* > 0, oo > R^{f)-R% > 0. 
Therefore, the bound 25 may not be useful if R^(f) — R*^ > 1. The same may be said about all of the surrogate loss 
functions. 

The second observation to be made is the application of 25 is clear when the Bayes risk is 0, and when R*^ is zero, but it 
is not clear when R*^ > 0. Furthermore, in practice we cannot know R<p(f) precisely, but we may get an approximation 
of R<f>(f) using techniques in the generalization and uniform convergence section of this paper. One suggestion is to 
use a validation set to get an empirical estimate of the risk of /, and re-purpose McDiarmid’s inequality for the new 
loss function to get a confidence interval for the true (j> risk of f. 

4.2 Using Accuracy for AUC: 

With a simple transformation of variables, one may transform an AUC optimization problem into an accuracy opti¬ 
mization problem. 


arg min 

wGR d 


1 

inn 


/W yi iw*(xi~xj)<o 

i-Vi=+1 j-y,—-1 


(26) 


arg min 
tueR £i 



(27) 


To see how this can be done, we can begin by making the following startling observation. For a binary linear classifier, 
where y* £ {—1,1}, one may arbitrarily relabel any data point say Xi by multiplying x, by 1 or negative 1 without 
changing the optimization problem. In particular, one may transform the 2 class accuracy problem into a 1 class 
problem with the following transformation of the data points z-i = y;,x t . With z, defined as such, equation 27 becomes 


arg min 

U!£] K d 



(28) 


Problem 28 can be thought of as finding the weight vector w for which u/z, is most often positive. A small minimum, 
and by extension a small risk may be found if there is a hyperplane that passes through the origin for which z-i is most 
often on one side of the hyperplane. 

Setting pi = Xi — Xj and re-indexing accordingly turns problem 26 into 


arg min 




(29) 


Where P = mn. We can solve 29 with our favorite solver, whether that is a support vector machine, ridge regression, 
logistic regression, etc. Our choice of solver for problem 29 amounts to picking our favorite surrogate loss function 
and evaluating 23. Sometimes, a solver such as a support vector machine will complain about the absence of 2 classes, 
if this is the case, one may artificially create 2 classes by employing the trick described by 28, this technique was 
observed to have no effect on the weight vector discovered by the solver. 


15 



A preprint - February 26, 2020 


4.3 Methods for testing AUC: 

Methods without cross validation: The previous subsection suggests the following approach for testing whether it is 
reasonable to optimize for AUC. Given a training set 



where x, ’s are positive instances and at'-’s are negative instances, and two regularization parameters A ouc > 0, and 
A acc > 0 and given a test set 



we can construct the transformed dataset 


Zi 

= Xi 

-x\ 

Z2 

= X\ 

- x' 2 

Zn 

= Xi 

— X* 

Zn+1 

= X 2 

-x\ 

Zmn 

= x m 

— x f 

^ n 


choose a surrogate loss function, and solve the two problems 

m+n 

-;- HUi wtx i ) + A accU> t W 

m + n 

and 

. / 1 

arg mm < - 

lueR 11 mn 

This gives us w acc and w auc , two estimates of a risk minimizing parameterization for the function / = w t x. We may 
validate both of these estimates on T by evaluating the AUROC of / on T. We may even provide confidence intervals 
by using the generalization results for AUC. Notice, we do not need to transform T to validate w auc . 

One obvious observation is finding w auc is computationally expensive since it requires a dataset with 0{mnd ) param¬ 
eters. This is by no means an efficient implementation of an AUC optimizer. For efficient implementations of AUC 
optimizers consult the references provided in the beginning of this section. 

The second important observation is we are not, in practice, given \ auc > 0 and X acc > 0 so we must use cross 
validation to find good values for these regularization terms. The validation set choosen for the AUC problem must be 
set aside before the transformation of the training set. 


y <t>(w t z i ) + X auc w t w 


16 



A preprint - February 26, 2020 


Method with cross validation: Given a training set 


X = 


Xi 
-1 
X2 


and a test set 


T = 


t2 


t'l 

~>t 

t‘2 


For each choice of X acc and for each choice of X auc We randomly partition the training set into a training and validation 
set. 



V = 


Xm' 





V = 


Xm* 




where m! + m* = m and n! + n* = n, we run the proceedure described by the subsection Methods without cross 
validation using the set V as the validation set and the set V as the training set. This gives us estimates of the 
expected performances of the two models for each value of X acc and X auc then we select the value of A acc and A aU c 
that performed optimally on the validation set. Finally, we apply the methods in the subsection Methods without 
cross validation on the full training set with the values of A acc and A aU c previously selected and evaluate our final 
performance on the test set T. 


17 



A preprint - February 26, 2020 


4.4 Experimental Results of the linear classifier: 

For our experiments, we run 10 1-versus-all binary classification problems. In each of which we run cross validation 
to select X acc and we run cross validation to select X auc . Finally we compare the auc scores of both the accuracy 
models and the auc models. In all cases, we use the support vector machine hinge loss as our surrogate loss function, 
for each 1-versus-all problem we run 20 trials, randomly shuffling the entire MNIST dataset and making note of the 
seed used before making the shuffle for complete reproducability. We run a cross validation loop on the training 
and validation sets taken to be the first and second hundred points of the shuffled training set, to select the best 
value for C ac c = ir - , and C auc = y3— , then we create a model fitting the 100 points in the training set using 
the C acc and C auc obtained through our validation loop. Finally we test this model on the 40000 datapoints that 
follow after the validation datapoints, this is our test set. We use a very small training set and a very large test set 
because the digit classification problem is a very easy one. When we take too many training points, we obtain a auc 
scores for both model close enough to 100 percent to have no hope of being distinguishable by our generalization 
results. Our generalization results require a lot of test points to ensure we can determine whether the models have 
significantly different performances, for each of the 1-versus-all problems we run this cross validation procedure 20 
times, shuffling the full dataset randomly each time and keeping track of the random seed. We provide a table of the 
average observed auc for each of the problems. We may use the results in the generalization section to observe that 
a 95 percent confidence that one model outperforms the other for a test set of 40000 points will require at least a 2 
percent difference in performance. 



accuracy 

auc 

0 versus all 

.98 

.97 

1 versus all 

.99 

.99 

2 versus all 

.93 

.90 

3 versus all 

.92 

.90 

4 versus all 

.94 

.93 

5 versus all 

.85 

.83 

6 versus all 

.96 

.96 

7 versus all 

.97 

.95 

8 versus all 

.87 

.85 

9 versus all 

.88 

.86 


In this test, for the mnist dataset we see that accuracy outperforms auc in many cases with a significance of 95 percent 
for the size of the test set. This seems to contradict later experiments, but it is important to note that the dataset is 
unbalanced by a factor of 1 to 9 which as we saw, should not provide that great of an improvement in the limiting 
case, and we are far from the limiting case here, in fact our training set size is 7 times smaller than the dimension of 
the features. This means we are getting very bad approximations of our covariance matrices. The conclusion perhaps 
is that mnist may not be a good dataset to benchmark this new model. We need a more difficult problem with more 
unbalanced data. 


4.5 The C minus one mu classifier. 

If we use the square loss function as the surrogate loss function we obtain a solution we like to call the c minus one mu 
classifier. While other loss functions give prohibitively slow methods for auc optimization, the square loss function 
provides an analytic solution that is not too difficult to compute. The square loss function replaces the indicator 
function with the following crude approximation: 

lx>0 ~ (1 

Thus we replace the problem 


with 


arg min 
feF 


1 

ran 


E E 

{?:y;=+l} {j:yj=- 1} 



(31) 


arg min 


~ E E (i-(/(^)-/fe))) 2 +A 2 ii/n 

{i:y;=+1} {j:yj=- 1} 


(32) 


If the square loss surrogate loss function is consistent with respect to AUC we may expect, in the limit as m and n go 
to infinity, the solution to 32 to be the solution to 31. So we may think of 32 as an approximation of 31. If we let f(x) 


18 




A preprint - February 26, 2020 


to be linear, so /,,, (x) = w T x, we get the following problem. 


argmin < - E E (1 — w T Xi + w T Xj ) 2 + A 2 w T w 

I mn z —' z —' 


fe f 


(33) 


{i: yi =+l} {j:Vj = -1} 


If we solve this for w we get what we call the C 1 /; classifier pronounced C-minus-one-mu. The following is a 
derivation of the C~ 1 / ( auroc classifier. We begin by computing the gradient with respect to w setting this to zero and 
solving for w. 

X7 W - y""' V"' (1 — w T Xi + w T Xj) 2 + X 2 w T w 


{i--yi=+ 1} {j-y j =—l} 


=v„ 


E E (1 — w T (xi — Xj)) 2 + A 


2 T 
W W 


{i-Vi=+ 1} {j:yj=-l} 


=V W - E E (1 — 2w T (xi — Xj) + w T (xi — Xj)(xi — Xj) T w) + A 


2 T 
W W 


{i-Vi=+ 1} {j'-Vj=-l} 


I' 2 I] I] (xi-xj) + 2 ( a 2 i + yy e (xi- Xj )(xi- 


Xi) T w 


{i-Vi=+ 1} {j-y j =—i} 


{i--Vi=+ 1} 


=- j — Y, nXi E/ I ^ 2 -*- + E/ 5Z (xixT ~ XjxJ — CCixJ + XjxJ) I W 


{i:j/i= + l} {j- y j = — 1 } 


{i:j/i = + l} {j':yj=-l} 


=2 -/*+ + a*- + — > 2 I +« I] XixJ + m J2 X > X J E E **K- E E 


{i:yi= + l} 




{i:yi= + l} \{j :j/j = — 1} 


{i:«/i=+l} \{j = — 1} 


=2 j — n + +n- + j a 2 i +— yy xixj + — yy 


m z —‘ n 

{i:yi=+\} {j-yj = -l} 


T T T 

XjXj — l^+H- - w 


—2 —fi + + + 


( A 2 I + — yy Xixf - /r+M+ + - E + I w J 

\ {i-Vi= + 1} {j-Vj=-l} J J 


-2 + n~ + ^A"I + C+ + C- + (/x+ — A — ji -) T ^ 


Where G' + is the biased estimate of the covariance matrix of the positive instances, and C_ is the biased estimate of 
the covariance matrix of the negative instances, and ji + is the mean of the positive instances while jx. is the mean of 
the negative instances. If we define fi = /j + — //_. and set the gradient to 0 and solve for w we get 

w = (A 2 I + C+ + C-+ nn T )-V 

assuming C = C+ + C_ + A 2 1 is invertible, the Sherman-Morrison Woodbury formula 

w oc (C) -1 /i 

Indeed, observe that if C is invertible, it is strictly positive definite and 

w ={C + nn r )—^n 


_! C 'ii/i' C 1 


= C" 1 - 


1 + n T C~ 1 n 
w =C~ 1 [i — fcC _1 /r 
w =(1 — k)^ 1 ^ 




Where 

l+/z T C-V’ 

Observe 0 < k < 1. Thus, w oc C~ ] ft. Also note that C is invertible for any nonzero A. Lambda is the regularization 
parameter. 


19 



A preprint - February 26, 2020 


4.6 Observations about the C minus one mu classifier 

The first observation about the C minus one mu classifier is it is not a classifier in the most strict sense. This is because 
it does not provide a threshold which could be used to separate the positive and negative instances of a class. Instead 
it may be thought of more as a projection of the input space onto a one dimensional space to maximize the probability 
that positive instances appear on the right side of negative instances. Because of this, it is very similar in nature to 
linear fisher discriminant analysis, and kernelized fisher discriminant analysis. In fact, if we take a close look at the 
differences between the two we see C minus one mu gives us 

w <x C -1 /r 

and linear fisher discriminant analysis gives us 

w oc S~ n 

Where if C = A 2 + C+ + C_, and S w is the in between class covariance which is given as S w = + ^-C_. 

That is, S w is an unregularized sum of the same covariance matrices as C but weighted so each class plays a role 
proportional to its frequency, where as an in the C minus mu classifier, each covariance matrix plays an equal role. 
If are dealing with a balanced set and we do not regularize, we get the C minus one mu vector is in fact proportional 
to the linear discriminant analysis vector. This provides some justification to use this in place of linear discriminant 
analysis to obtain features that separate the data based on the two classes of data. 

In the kernalized version of the C minus one mu classifier, the matrix K + + K_ is singular and therefore the regular¬ 
ization term is necessary. There is an alternative proceedure that may be used for regularization which was observed 
to work well especially in conjunction with the A 2 regularization. This was to randomly remove some of the features 
from the positive and negative sets. This was found to provide more robust covariance estimates. 

4.7 C minus one mu Experimental Results: 

We begin this section by providing some explicit calculations to see if there are some underlying distributions for 
which the C minus one // classifier will provably outperform the accuracy classifier obtained through ridge regression. 
Suppose we have two distributions. 

U+ ~ JV(mi,E+) 

and 

U- ~ W(m 2 , E_) 

and suppose X is a realization of a random sample of n data points that are sampled from U+ with probability tt and 
[/_ with probability 1 — 7r. Recall the C minus one \i classifier predicts 

w = (C+ + C_ + A I)-\n+ - A*-) 

while ridge regression predicts 

w* = ( X T X + \I)~ 1 X T Y 

One obvious observation to be made is both equations have comparable running time to evaluate. If n p is the number 
of positive instances and n n is the number of negative instances, we see that 

w* = ( X T X + \I)~ 1 (n p y + — n n p-) 

Lets compare AUC W = P(w T (U+ — U-) > 0) to AUC w * = P(w* T (U+ — U-) > 0) the AUC of the model 
parameterized by the auc solution and the AUC of the model parameterized by the accuracy solution. Observe that 

U+ — U- ~ N{m+ — m_, £ + + E_) 


So 


w T (U + — U-) ~ N(w T (m + — m_), u> t (E_|_ + E -)w) 


and 

w* T {U + — f/_) ~ N(w* T (m + — m_),w* T ( E + + E_)tn*) 

Let U be the random variable that is sample from the U+ distribution with proportion 7r and U- distribution with 
proportion 1 — 7r so that X is a realization of U. Further simplification of E\U r U] provides us the following: 

E[U t U] = E[E[U T U\y\] = n(E[UlU+]) + (1 - n){E[UZU-]) 

= n(Y,+ +E\U\y = +l]E[U\y = +1] T )+(1—7r)(E_+£ , [[/|y = —l]E[U\y = — 1] T ) = 7rE + +7rE_+7rm_|_m^+(l—7r)TO_?n^| 


20 



A preprint - February 26, 2020 


Therefore, w* is an empirical estimate of 

w* « (7r£ + + (1 — 7 t)£_ + 7tto_|_to^ + (1 — 7r)m_m^) _1 (7rm + — (1 — 7r)m_) 

on the other hand, u> is an empirical estimate of 

w « (£+ + £_) _1 (m + — m_) 

Now the auc of u> and the auc of w* are given by P(w T (U + — UP) > 0) and P(w* T (U + — UP) > 0) respectively. 
P{w T {U + — £/_) > 0) = P(w T (U+ — t/_) — w T (m + — m_) > — w T (m + — m_)) 


= P 


(U + — UP) — w T (m + — mP) —w T (m + — mP) 
y / tn T (£+ + £_)u> a/ w t (Y, + + £_)ui 


Similarly, 


P(w;* J ([/+ - C/_) > 0) = P 


w* T (U + — £/_) — w* T (m. + — m_) —w* T (m + — m_) 


\/ ■u'* T (£+ + £-)i 


> 


\/ ■u'* T (£+ + £-)i 


If we recall that U+ and U- are normally distributed, so U+ — U- is normally distributed with a mean of m\ — m_ 
and a covariance of £ + + £_ we see that the random variable 

7 w T (U + — UP) — w T (m + — mP) 

\fw T (T. + + £_)w 

has a standard uniform normal distribution. Z ~ N(0, 1), moreover 

z * _ w* T (U + - U-) - w* T (m+ - mP) 

V / w* T (£+ + £_)«;* 

is also a uniform standard normal random variable. Thus 

w T (m + — mP) \ 


and 


auc{w) = CDF 


auc(w*) = CDF 


s \Jw T (Yi + + £_)«; 
w* T (m+ — mP) 


\y/w* T {E+ + Z_)w* ) 

Where CDF is the cumulative distribution function of the standard normal distribution, which is monotone increasing. 
Therefore, maximizing the auc amounts to maximizing 

w T (m+ — mP) 


\/w T (T, + + £_)w 

A little bit of matrix calculus will reveal that the auc solution for w is in fact the unique maximizer of 

w T (m+ — mP) 


\/w T (£ + + £_)ui 


Indeed, maximizing 

with respect to w is equivalent to maximizing 

subject to the constraint, u> T (£ + + Y,-)w = 
necessary condition that the solution to 


w T (m+ — mP) 
a/w t (£ + + £_)u> 

w T (m+ — m ) 

1. We may use the method of Lagrange Multipliers to consider the 

w T (m + — mP) 


subject to w T (Yi + + £_)uj — 1 = 0 is a critical point of the problem 


w T (m+ — m_) + A(w T (£ + + £_)tn — 1). 


21 



A preprint - February 26, 2020 


Assuming 


E+ + E_ 


is positive definite, we get a unique critical point at in oc (E+ + E_) 1 (m+ — m_) This tells us that the C 1 /i 
classifier is assymptotically the best performing assuming the negative and positive classes are normally distributed. 


In the following table we provide some examples of the auc scores expected from ridge regression classifiers and auc 
classifiers assuming the negative and positive classes are each sampled from Gaussian distributions of 50 variables 
with random covariance matrices. 


As an example, suppose 


and 


and 


and 


we obtain. 


E + = 


2 2 
2 4 



-1 

5 


TO+ 


1 

1 


?n_ 



Cl if 


while for 7 t = \ 


for 7T = | 


for 7r = 


for 7T = j 


for 7T = | 


for 7T = 5 


for 7T = | 


w T (m+ — m_) 
\/w T ( E + + E _)w 


= CDF 


44 

35 


CDF 


CDF 


CDF 


CDF 


CDF 


CDF 


CDF 


w* T (m+ — m_) 
i/iu* T (E + + E_)ii 


= CDF I J ^ j « CDF( 1.12122) « .869 


CDF( 1.109) « .866 




(m+ — m_) 


VV T (E+ + E_) 


ur 




(ra+ — ra_) 


^ v'w* T (S+ +E-)«J* 

w* T (m+ — m_) 
y / w* T (E + + Y,_)w* 

* w* T (m + — rri-) 
i V / w* T (E + + E_)w* 


„*T 


(m+ — m_) 


vV- T (E + + E_)«;* 

w* T (m .|_ — m_) 
y / 'u;* T (E + + E_)iu* 


CT>F(1.121) « .869 


££^(1.09) « .862 


CT>F(1.118) « .868 


CDF(1.08) « .866 


CDF(l.llO) « .867 


CT>F(1.074) « .859 


22 



A preprint - February 26, 2020 


A classifier designed to minimize the AUC should be independent of the ratio of positive instances to negative in¬ 
stances, and its AUC score should be greater than or equal to the that of the ridge regression classifier. In practice. The 
C~ l /j classifier does not perform as well as we might hope. This may be because the the C~ x matrix is ill conditioned. 
To combat this, we will want our dataset to be large enough so the total number of data-points in each of our classes 
exceeds the dimension of our data. This will ensure more robust covariance matrix estimates for the AUC classifier. 
Also, we may observe that the difference in performance of the two methods is at most 1 percent, and this is for a 
relatively unbalanced dataset. Thus, if we expect to see any significant difference in performance, we will likely need 
to consider very large test sets, and very unbalanced datasets. Unfortunately, very unbalanced datasets need a lot more 
training and test data to make the covariance matrices reasonable estimations, and to ensure enough test points for 
statistical significance. 



auc n = 50 

acc n = 50 

auc n = 2 

acc n = 2 

7T = .5 

.877 

.877 

.818 

.818 

7T = .25 

.866 

.850 

.833 

.826 

7T = .125 

.874 

.827 

.866 

.857 

7 r = .0625 

.878 

.797 

.865 

.836 

7T = .03125 

.875 

.757 

.823 

.787 

7T = .015625 

.878 

.715 

.853 

.817 

7T = .0078125 

.881 

.682 

.855 

.830 

7T = .00390625 

.865 

.651 

.870 

.833 

7T = .001953125 

.867 

.645 

.826 

.791 


Here n is the number of features and the scores where taken by averaging over 20 randomly generated covariance 
matrices for the two classes. From the table above we can see, for n = .5 there is no observable difference in auc, 
and in general as the problem becomes more unbalanced, the auc classifier outperforms the accuracy classifier. This 
performance increase is most evident in the high dimensional situation where we see over 20% improvement over 
ridge regression. On the other hand, with low dimensional data we only see about a 3% improvement. It is important 
to observe that even a 10 percent to 90 percent imbalance of the data with 50 features in the best case will only provide 
something like 4% improvement in the limiting case. A very significant imbalance is needed to see substantial benefits 
of the auc classifier over the accuracy classifier. 


5 Kernelizing the AUROC classifier: 


Classification models with linear decision boundaries may be used to solve classification problems with non linear 
decision boundaries by performing a nonlinear transformation of the feature space. Observe that the procedure outlined 
in the section Method Of Surrogate Loss Functions applies equally well if we let S C H where H is any Hilbert 
space of our choosing. We simply replace w t x with (w,x) where w,x £ H, and (•, •) is the inner product in H. 
H may be finite or infinite, but usually contains a countable basis set. If we let S be an arbitrary sample space and 
77 : S -A H be a function taking elements of S to elements of a Hilbert space EL Then we can evaluate the empirical 
risk minimization problem 


arg min 
hen 


1 

mn 


E E I(h,r](xi))>(h,ri(xj)) 

i-yi -+1 r-Vj --1 


(34) 


analogous to the equation 21. That is, we can apply the techniques briefly outlined in the section Method of Surrogate 
Loss Functions to find an approximate solution to equation 34. From there, we can use properties of the Hilbert 
space H, and the fact that we used a regularization term A in our surrogate loss procedure to get upper bounds on 
the rank shattering dimension of our family of feasible solutions. Observe that the KKT conditions imply that we 
solve a constrained minimization problem in H so our set of feasible solutions is smaller than EL Often, El is finite 
dimensional, so we may use the fact that the shattering dimension of R' / is d to get a bound on the rank shattering 
dimension of H. This will give us uniform convergence guarantees. Occasionally, H is infinite dimensional. In such 
cases, the regularization parameter A still provides us with with small enough rank shattering coefficients to give us 
useful uniform consistency bounds. 


For example, suppose our training set is T> = {—6, —5, —4, —2, —1, 0,1, 2,4, 5,6} and the corresponding class labels 
are Y = {0, 0,0,1,1,1,1,1,0,0,0}. The data set together with two classifiers for this dataset are presented in figures 
2 and 3. Although this dataset is easy to classify by inspection. There is no linear classifier in R that provides an 
adequate classification. This is because a linear classifier in R reduces to 


c(x , 6) 


1 */ x > 9 1 
0 if x < 9 J 


(35) 


23 




A preprint - February 26, 2020 



Figure 2: This is the best attainable linear classifier for this dataset. This classifier missclassifies the 3 rightmost points. 
This classifier is given by setting 9 = —3 in the classifier 35. 


or 


c(a;, 9) 


0 if x > 9 ) 
1 if x < 9 j 


Instead we want to use a classifier of the form 


c(x 1 9i,9 2 ) 


1 if 9\ < x < 9 2 
0 otherwise 


(36) 


(37) 


Consider what happens if we define rj(x) = x 2 . Then rj : T> T>' where V = {0,1,1,4,4,16,16, 25, 25, 36, 36} and 
Y is still given by Y = {1,1,1,1,1,0,0, 0, 0, 0,0,0}. Under this nonlinear transformation, the dataset V is linearly 
separable. In fact, a linear classifier in the image of 77 is precisely a classifier of the form 37, where 0 \ = — 02 - 


5.1 Moore-Aronsajn Theorem and the Kernel Trick: 

It is not always necessary to know 77 explicitly. It is often enough to simply know there exists a transformation from S 
to a Hilbert space H and to know how to compute inner products in H. 

Definition: A function k : S x S —> R is said to be positive semi definite if for every 1 < n G IN, for every 
{ai,..., a n } C R and for every {x\, ..., x n } C S 

n n 

EE aiO,jk(xi,Xj) > 0 

*=1 i = 1 


Definition: A function k : S x S —> R is said to be symmetric if k(x, y) = k(y, x) for all x,y € S. 

Suppose there exists a Hilbert space H over the field of real numbers and a function // : S H* I'I, then for any 
x, y £ S (r](x), r]{y))M = {n(y)> v( x ))u by properties of inner products over the real numbers. Thus, for k(x, y) = 


24 



A preprint - February 26, 2020 


nonLinClass.png 


Figure 3: This classifier may be the best classifier for this dataset but it is not linear. This classifier is given by setting 
9\ = —3 and O 2 = 3 in the classifier 37. We may arrive at this classifier by letting rj(x) = x 2 and using the classifier 
36 with 9 = 9 on V. 


{r){x),ri(y)), k must be symmetric. Moreover, for any 1 < n € I and for any {x\-,X 2 , ■■■,x n } C S and for any 
{cil, ..., CTn} A R, 


n n 


aia i M Xi )> vixj)) = ai H a 3 ( r i( x i),v( x j)) 

i =1 3 =1 *=1 j'=1 

n / n ^ 

*=1 \ j=l / 


= ( X! a Mxi), a M x J ) 

i=i / 


u=l 


If we let s = a iV( x i) we see ^at we get 


y y aia.jk{xi , at,-) = X! a * a j M®*))7(^)) = («, s) > 0 

2—1 jf —1 2=1 J = 1 


by properties of inner products. Thus, for k(x,y ) to correspond to some inner product in some Hilbert space, it is 
necessary for k to be both symmetric and positive semidefinite. 

Definition: Given a Hilbert space H of real valued functions on a nonempty set X, and a function k : X x X —»• R, 
k is called a reproducing kernel for H and H is called a reproducing kernel hilbert space (RKHS) if the following 
properties hold. 


1. x) g H 

2. V/ G H and Vx G X, (/, k(-,x)) = f(x) 

The important facts to emphasize are that H is a Hilbert space of functions from X to R, and /,:(•, x) for any x G X 
is one such function that happens to be parameterized by x, the evaluation of a function / at a point x is computed 
by taking the inner product in H between the function / and the function in H parameterized by x, that is f(x) = 


25 




A preprint - February 26, 2020 


(f,k(-,x)). For instance, k(-,x) and k(-,y) are both functions from X to R. They are both elements of H and 

k(x,y) = (k(-,x),k(-,y)) 

The Moore-Aronsajn Theorem provides us with a very useful statement that says if k(x,y) is positive semidefinite 
and symmetric then there exists a unique RKHS with reproducing kernel k. 

Therefore, given any positive semidefinite and symmetric function k, we may rest assured that there is an // and an H 
for which k(x, y) = (r](x), r](y)) moreover, IH is an RKHS. 

Lets take another look at 34, Moore-Aronsajn tell us for any positive semidefinite symmetric function k, we may 
consider the empirical risk minimization problem 


arg min 
k£H 


1 

mn 


X X I (h,k(-,Xi))>(h,k(-,Xj)) 

i:yi=+l j:yj=-l 


(38) 


where H is the RKHS that corresponds to k. In practice, we cannot minimize over all h € IH because we do not have 
any convenient representations of elements of H. There are no explicit rj functions. However, recall for every x £ S, 
k(-,x) € H, and since H is a vector space, span{k{-, X\), ...,k(-,x n )} C H in fact span{k(-,Xi), ...,k(-,x n )} is 
itself a finite dimensional Hilbert space. We can certainly evaluate 


arg mm 

h£span{k(- ,x±). ,k(- ,x n )} 


mn 


E E I (h,k(-,Xi))>(h,k (-, 


®j)> 


i:yi=+l j: yj =-l 


= arg mm 
{ai,...,a Tl }cR 


E E J (Er= i- 




= arg mm 
{ai,...,a n }cR 


mn 


E E 7 r:r=i«« 


k(xi,Xi)>J2i=i aik(xi,Xj) 


i-. yi =+l j: Vj =-l 


where this last step comes from the reproducing property of the RKHS H 
The next step is to solve the surrogate problem 


arg min 

a£R" 


1 

mn 


E E (^{yi^Ki) + A a T Kot 


(39) 


Where K l3 = k(xi, Xj) and K, is the ith column of K. Using the Moore-Aronsajn theorem it is not difficult to prove 
that the solution to 39 is also the solution to 


arg min \ — Y] V] 4>{yi {h, k(-, Xi)) + A (h, h) > (40) 

/igH mn . . . . 

This provides us a guarantee that tells us we may solve 38 by considering 39 in the limit as the number of training 
points go to infinity. 

By defining z t = K~ i Ki and fi = K' in problem 39 may be solved as the linear problem 


arg min 

/3eR" 


1 

mn 


X X &{yif3 T Zi) + X/3 t /3 


This problem can then be solved as described in the Methods section. 
For the kernelized C minus one mu classifier we get the solution: 

0=(C+ + C-+\I)- 1 n 


( 41 ) 


26 



A preprint - February 26, 2020 


Where C+, C-, and // are computed using the training set of Zi = K 2 K, vectors. Solving for alpha in terms of AT 
we see 

a = A ' - 5 (K~ 1 C+ + K~ l C*_ + = {C* + + C*_ + AA") -1 ^* 

where _ 

^*= E R i- E ^ 

i:j/i=+l 3-Vj =-1 

C; = CW{iT+}, 

C* = C'ou{A'_}, 

A' + is the set of vectors A', such that j/, : = +1, K_ is the set of vectors kj such that y 3 = —1, and Cov{X } is the 
covariance matrix for the vectors X. It is important to note Cl jf. + C*_ will always be rank deficient by at least 1, so for 
this algorithm, A must be chosen strictly positive. 


6 Appendix A: 


Let us consider the subtle difference in the definitions: 

nm { i, yi =+i} {j>yi =-i} 


A(f,T) I{f(x i )>f(x j )} + 2Mf( x i)=f(, x j)} 


A(f) = E X~D +U X'~D- 1 { I {/(AT)>/(.Y')}} 

A\f) = Er~D +1 ,xN D _, | I (/(-Y)>/(A')} + 2 I I/( x )=/(- y ')}| 

more rigorously. 

Let 

A'(f,T)-A(f,T)=w t 

where uj t is very small and positive and is equal to -, p where p is the proportion of the training data point pairs ( Xi,Xj ) 
with i such that y* = 1 and j such that y 3 = —1 for which f(xi) = f(xj), also, let 

w = A'(f) - A(f) 

be half the unknown proportion of point pairs in the underlying distribution for which f(x) = fix’). Notice, u>t ~ oj 
in the sense that oj t is the empirical approximation of oj. Let us evaluate 

PT x \„=y{\A(f,T)-A(f)\>e} 

we assume \oj t — co\ < e 

Prx\y=A\A(f,T) - A(f )| > 6} = P Tx]y=y ,{\A(f,T) - A'(f,T) + A'(f,T) - A’(f) + A'(f) - A(f)\ > e} 

(42) 

< P Tx]y=y ,{\A(f,T) - A'(f,T) + A\f ) - A(f )| + \A'(f,T) - A\f)\ > e} 

(43) 

= P Tx \y=y'i\A'(f,T) — A'(f)\ + |w — w t \ > e} (44) 

= P Tx \y=y{\A'(f,T) - A'(f )I > e - | W - Ut\ > 0} < 2e- 2 rA-p) N (‘-\"-"<\) 2 

(45) 


This tells us that, so long as the empirical estimate of the proportion of data points for which /( x) = f(x') is an 
accurate one, we get a similar generalization bound as that given by [1]. In all our experiments, that empirical estimate 
is 0, as can be expected in all experiments with no repeats in input data. For the entirety of this paper, we assume that 
u>t = oj and thus we apply equation 12 as the generalization bound to both definitions of AUROC. 


27 



A preprint - February 26, 2020 


7 Appendix B 


Theorem 5 Let f : X —> IR, be a fixed ranking function on X and let p(T y ) be the proportion of positive instances in 
the test set, then for any e > 0, 


P 


\A(f',T) — A(f)\ > 


Mi) 


2p(T y )(l-p(T y ))N 


< 6 


(46) 


Proof: Given the sequence of labels, y' , the random variables X\ , ... , X,v are independent, with each X;,. taking 
values in X. Lets define a function ■ X ,v —> 1R as follows: 

0 (xi,.... xat) = A(/; ((xi, j/i),..., (x N , y N )))- 
Then for each k such that y k - 1 we have for all x,, x' fc G X: 

|^(*Tl? •• •) ^n') •••j t€ k — X k , X k j r \, ..., ) | 


1 

mn 


^ ((l{/(®fc)>/(^)}) 




< 


mn 


Similarly, for each fc such that y k = — 1 we have for all x^, x' fe G X, 

|0(Xl, ..^Xjv) - 0(Xi, ...,X fc _ 1 ,X fc ,X fc+1 , ...,Xjv)| < - 

n 

Therefore, McDiarmid’s inequality tells us, 

P Txly=y ,{\A\f;T) - A'(f)\ >e}< 2e~ 

Where c k = ± if y k = 1 and c k = 4 if y k = -1 Now, 


1 

m 


N 



i =1 


1 1 nn + n 

m n mn 


P Tx \ y=y ,{\A{f-T) - A(f )| > e} < 2e- 2e2mn/(m+n) = 2 e - 2£2p(1 - p)Ar (47) 

The rest of the argument is the same as presented for A'(f , T) and A'(f). 


References 

[1] Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, and Dan Roth. Generalization bounds for the 
area under the roc curve. Journal of Machine Learning Research , 6(Apr):393-425, 2005. 

[2] Vladimir N. Vapnik. An overview of statistical learning theory. IEEE TRANSACTIONS ON NEURAL NET¬ 
WORKS, 10(5), 1999. 

[3] John Duchi, Lester Mackey, and Michael Jordan. On the consistency of ranking algorithms. Proceedings of the 
27th International Conference on Machine Learning, Haifa, Israel., 2010. 

[4] Wei Gao and Zhi-Hua Zhou. On the consistency of auc pairwise optimization. Proceedings of the Twenty-Fourth 
International Joint Conference on Artificial Intelligence, 2015. 

[5] Alain Rakotomamonjy. Optimizing area under roc curve with svms. In ROCAI, 2004. 

[6] Yiming Ying and Ding-Xuan Zhou. Online pairwise learning algorithms. Neural Comput., 28(4):743-777, April 
2016. 

[7] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22Nd 
International Conference on Machine Learning, ICML ’05, pages 377-384, New York, NY, USA, 2005. ACM. 

[8] Hiva Ghanbari and Katya Scheinberg. Directly and efficiently optimizing prediction error and AUC of linear 
classifiers. CoRR, abs/1802.02535, 2018. 

[9] Michael Natole, Yiming Ying, and Siwei Lyu. Stochastic auc optimization algorithms with linear convergence. 
Frontiers in Applied Mathematics and Statistics, 5:30, 2019. 

[10] Shivani Agarwal. Surrogate regret bounds for the area under the roc curve via strongly proper losses. In In 
COLT, 2013. 

[11] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds. Journal of 
the American Statistical Association, 101(473): 138-156, 2006. 


28 



